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The effects of aggregation on the reliability of 
measures of academic performance were explored in two studies. In 
first study, 30 elementary-age children were tested four times on 
same forms of the Woodcock Reading Mastery Tests and the Ginn 720 
Reading Passage measures. Group stability coefficients, 
within-subject reliability coefficients, and group correlations 
between variables each were calculated on the basis of one or two 
testings and then on the basis of aggregations over four testings. On 
the standardized measure and on the oral passage reading correct rate 
score, aggregation had little impact; however, on the oral passage 
reading error rate score, aggregation substantially increased all 
reliability indices. In the second study, 78 children were tested 10 
times on alternate forms of two reading measures and one written 
expression measure. Group stability coefficients were calculated on 
the basis of 2, 4, 6, 8, and 10 testings. For the oral 
words-in-isolation reading correct score, aggregation had little 
effect, whereas aggregating over occasions and test forms 
dramatically improved the stability of the oral words-in-asolation 
reading error score and the written expression score. The reliability 
and criterion validity of short, simple measures demonstrated their 
suitability as measures of academic performance. (Author/PN) 
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Abstract 

The effects of aggregation on the reliability of measures of - 
academic performance were explored in two studies. In the first 
study, 30 elementary-age children were tested four times on the same 
forms of three reading measures; group stability coefficients, within- 
subject reliability coefficients-, and group correlations between 
variables each were calculated on the basis of one or .two testings and' 
then on the basis of aggregations over foor testings. On the 
standardized measure and on the oral passage reading correct rate 
score, aggregation had little impact; however, on the oral passage 
reading error rate score, aggregation substantially increased all 
reliablity indices. .In the second study, 78 children were tested 10 
times on alternate forms of two "reading measures and one w rit-flpn 
expression measure; group stability coefficients were calculated on 
the basis of ?, ; 4, 6, 8, and 10 testings. For the oral words-in- 
isolation reading correct score, aggregation 'had little effect, 
whereas aggregating over occasions and test forms dramatically 
improved the stability of the oral words-in-isolation reading error 
score and the written expression score. Implications -for the 
measurement of academic behavior are discussed. 
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Use of Aggregation to Improve the Reliability of Simple 
Direct Measures of Academic Performance 

According to the Standards for Educational and Psychological 
Tests (APA, 1972), criterion validity is a broad c^ass of test 
validity that assesses the usefulness of a measure as a predictor of 
other variables. Criterion validity questions typically address the 
suitability of substituting a test for a longer, more cumbersome, "br^ 
more expensive criterion. Therefore, the concern is with verifying 
the existence and strength of useful relationships, under applied « 
conditions (Messick, 1980). 

Criterion-relatetfness is determined by correlational analyses and 
extensions of correlational analyses to multivariate analyses. The 
most elementary example is the correlation of an individual predictor 
test with an individual criterion (Nunnally, 1978), where the strength 
of that correlation specifies the degree of predictive efficiency 

between the measures. In most criterion- ^elated or prediction 

<• 

problems, psychometric theorists agree that it is~ reasonable to expect 
- only modest correlations between a criterion and predictor test 
(Nunnally, 1978; Terwilliger, 1980). One reason for these modest 
correlations is the imprecision ,or unreliability that attentuates 
observed correlations (Stanley, 1971). 

In studies of criterion validity, one method commonly employed to 
reduce random error and simultaneously to improve the extent to which 
true .relationships are observed is to increase the sample size. 
However, as Epstein (1980) makes clear, a fundamental but widely 
ignored alternative strategy is to aggregate observations over 
situations and/or occasions. The law of sampling distributions holds 
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that behavior aqgreqated over stimuli or occasions as'well as over 
individuals should reduce measurement error and improve the basis for 
establ ishinq reliable, general izable relationships. 

In a series of four studies, Epstein (1979) demonstrated that 
aggregating over occasions, in fact, 'did render more reliable 
correlations. He found that when a wide range of personality measures 
each were averaqed over an increasinq number of occasions, stability 
ccfeff icients, indicative of a measure's reliability or precision, 
increased to high levels. In these studies, Epstein found that 
relations between variables observed on one occasion were lower than, 
and sometimes opposite from, relations between the same variables 
observed and^aveYaged over several occasions. This pattern held not 
only for personality measures, but also for direct obs-ervat ions of 
behavior and even a physioloqic'al index of heart rate. 

The two experiments reported here examined the hypothesis that 
this phenomenon may apply to the measurement of academic behaviors-* 
These investigation^ are relevant for educational measurement, in 
qeneral, because /hey provide information concerning how to measure 
more accurately students ' academic performance. More specif ical ly, 
they are relevant for frequent measurement and continuous time-serjes 
evaluation strategies, where the practice of aqqregatinq performance 
across occasions and/or test forms is routine, but where the frequency 
with which measurement need occur is unclear. Results of these 
studies should provide practitioners, who measure student performance 
on goals frequently and who format ively evaluate student programs, 
with information concerninq how many data points are necessary before 
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reliable and valid estimates 'of student performance are achieved. 

The first experiment explored questions related to the 
measurement of reading behavior oh the same test sampled over 
occasions. The second investigated issues concerning the measurement 
of reading and written expression performance when behaviors are 
sampled over occasions and over parallel test forms. 

Study 1 

Study 1 posed three questions. First, it asked: How does 
aggregating students' scores on a test administered on several 
occasions affect stability in the measurement of academic performance? 
The study compared stability coefficients for reading behavior 
measured and averaged over two occasions with coefficients for the 
same behavior measured and averaged over four occasions. 

« The second question addressed in Study 1 was: If aggregation 
improves the stabi lity of academic measures as explored through 
correlational analyses, then to what extent does it allow one to 
predict more .accurately an individual's true score? To explore this 
question, within-subject reliability coefficients were examined, with 
subjects' behavior first observed on two occasions, then observed on 
and averaged over four occasions. 

Question 3 in this study explored: How does aggregation over 
testinq occasions affect the strength of relations between measures of 
academic performance? Specifically, the study compared the strength 
of relations between two reading, behaviors when the data were 
collected on a single occasion with the 'strength of relations when 
data were col lected on and averaged within subjects over four 



occasions. 
Method 

Subjects . .Ninety English speaking students, distributed across^ 
the six elementary grade levels, were selected randomly from one 
midwestern metropolitan school for inclusion^ in a separate study. 
From this pool of 90 students to whom the dependent measures were 
administered as part of a larger battery of tests, 30 subjects (M=15, 
F=15) evenly distributed among grades 1-6 were selected randomly. 

Measures . The measures were: (a) from the Woodcock Reading 
Mastery Tests (Woodcock, 1973), the Word Identification Test of Form A 
(WRMT); and (b) from the Ginn 720 reading series, a 200 word passage, 
representative of tlje average readability (3.6^/rom the last 25% of 
level 8. (See Fuchs & Deno, 1981 for pass-age selection procedure.) 

Procedure . According to a standard format, the 30 students were 
tested individually four times by a trained examiner. On one of these 
occasions, the measures were administered within a larger battery of 
tests; this testing session was approximately 60 minutes. Each of the 
other three sessions lasted approximately 10 minutes. 'Each student 
was assigned randomly to one of four -groups, each of which received 
the longer battgry at a different point in the seqifence of the four 

administrations. Additionally, the order in which the measures were 

* i 

administered within a test session was random. 

Data analyses . The data were subjected to three analyses. The 
first analysis was to obtain group stability coefficients within 
variables . These coefficients were obtained- for the fol lowing 
variables: (a) the WRMT raw score, (b) the words correct per minute 

3 
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score on the Ginn 720 reading passage, and (c) the errors per minute 
♦score on the Ginn 720 readinq passage. Odd-even stability 
coefficients (Epstein, 1980) ^ averaqed^ first across two days 
(correlation between benavior on Day 1 and behavior on Day 2) and then 
K ^across four days (correlation between behavior a^eraqed over Days 1 
and 3 and behavior a\eraged over Days 2 and 4), were calculated and 
* compared. 

A secord* analyses was conducted to obtain within-subject 
reliability c oe fficients . For the Ginn 720 correct per minute score 
and the Ginn 720 error per minute score, a reliability coefficient 
(percentage of overlap) was calculated between Ja) Day 1 and Day 2, K 
9 and (b) the average of Days 1 and 3 and the average of Days 2 and 4. 
* These coefficients were compared for each variable, 

A third analysis examined group correlations between variables . 
Correlations were calculated between (a) the WRMT raw score and the 
" words per minute correct score on the Ginn passages, and (b)'the WRMT 
raw score and the error per minute score on the Ginn passages. First, 
these correlations were based on each subject's performance on the 
f irst^occ^sion. Then, the correlations were recalculated^ the basis 
of the average* of each subject's .performance on each variable across 
the four occasions. The strength of relations based on one occasion 
was compared to the strength of relations based on four occasions. 
Results 

Question 1: How does aggregating students' scores on a test 
administered on several occasions affect stability in the measurement 
of academic performance ? As displayed in Table 1, 2- day and 4-day 
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group stability coefficients for, the three dependent variables were 
statistically .significant (j> .< .001). The correlations for the WRMT 
raw score and the Ginrr words correct score were high and similar; 
correlations were low for the Ginh error rate score. This indicates 
greater precision or rel iabi 1 i tr+for the WRMT and correct- rate scores 

0 

relative to the error rate score; 



Insert Table 1 about here 



Within each measure, stability coefficients increased from 2-day 
to 4-day aggregations. The 2-day.error rate'coeff icient initially was 

9 

.18 (22%) lower than both the 2-day" WRMT and the 2-day correct rate 
coefficients. However, the error rate 4-day coefficient improved .15 
(19%) oyer its 2-day coefficient while the WRMT and correct rate 
coefficients remained nearly the, same* Therefore, the 4-day error 
rate was very similar to the 4-dav WRMT and the 4-day correct rate 
coefficients. It appears, then, that aggregation positively affected 
the reliability of error 'rate scoc.es; it had no impact on the WRMT or 
the correct, rate scores. 

Question 2: To what extent does aggregation allow one to predict 
more accurately an individual's true score ? Table 2 displays, for 
each measure, the within-subject reliability coefficients: (a) the 
mean percentage of overlap between scores on Day 1 and Day 2, (b) the 
mean percentage of overlap between the average of scores on Days 1 and 
3 with the average of scores on Days 2 and "4", and (c) the mean within- 
subject changes between 2-day and 4-day coefficients. As with the 



stability coefficients, these, mean reliability coefficients were 
highest for the VfRMT and 'lowest: for the Ginn error rate measures; 
Again, small differences were noted between the 2-day and 4-day WRMT 
coefficients; the difference was slightly larger for Gjnn correct rate 
and largest 'for Ginn error rate. "Mean wi thin-sujyeqt^hanges were 
ord^ed t in a similar manner;. Therefore, whereas the G^nn 2-day error 
rate 'coefficient was .31 (47%) below the WRMT 2-day coefficient, the 
'error rate 4-day coefficient was only .25 (34%) below the WRMT 4-day 
coefficient. It appears that aggregation allows one to predict an 
individual's, score mor^ accurately for error rate, , but has little 
effect on WRMY or correct rate Scores. 

1 V 1 
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Q uestion -3: How (toes aggregation over testing occasions affect 

the relation between .measures of academic performance? Two sets of 

correlations were computed between (a*) WRMT raw >scoce and Ginn words 

correct'rate score, and/(b)- WRMT ; raw score and" Ginn -error* rate score. 

The first r? set was based on scores' on one day; the second -set was based 

on the average score across the four occasions. The correlations and 

their ^-values are displayed ip* Table" 3. All correlations were* 
• * r + * 

statistically significant.* For the* stable measures, WRMT and Ginn 
words correct rate scores, the 1-day coeff icient^was high (^91) and 
remained ^t approximately *the same level when calculated on. the basis 
of four days ,(.89). However, the correlation -between WRMT scores and 
the least- stable measure- of error rate based* on one * day (-.46) 

..' * ■ . - - 



increased 18% when calculated on the basis of four days (-.53). As 
with the other analyses, then, it appears that aggregation affects the 
strength of relation when error rate is involved, but does not affect 
the strength of relation when correct rate is involved. 



Insert Table 3 about here 
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Discussion 

The WRMT and the Ginn correct scores initially were precise, 

reliable meaures, as'., evidenced by all three statistics, the 2-day 

group stability coefficients, and 2-day within-subject reliability 

coefficients, and the 1-day correlation between the WRMT c and Ginn 

correct rate scores. For these initially reliable measures, 

aggregating on the same' test over occasions made no important 

contribution to the measures' stability or to the strength of the 

relations between measures. ► - 

♦ 

However, aggregating on the^'same' test over occasions appeared to 
have an important effect on -the least stable measure, the' Ginn error 
rate. Aggregating over four days substantia-1 ly enj^ed the error 
rate group stability coefficients, the within-subject reliability 
coefficients, and the- strength of relation between variables. 

Additionally, the' finding that error rate, the least reliable 
measure, manifested an initially weak relation with dther measures 
corroborates other studies of criterion-relatedness between simple 
measures and achievement tests (Deno, Mirkin, Chiang, & Lowry, 1980; 
Fuchs % Deno, 1981). However, this study suggests that when 



performance is sampled and aggregated across time/ as is routinely 
done in frequent measurement and continuous evaluation, error rate 
becomes a more stable, reliable, precise measure and its criterion 
validity with other measures improves. 

Study 2 ' 
Wh i le the effects of sampl ing on the same test form over 
occasions were explored in Study 1, the impact of sampling on parallel 
test forms over occasions was examined in Study 2. By aggregating 
performance across stimuli (test forms) in addition to aggregating 
performance over occasions, two types of error in pupils' scores 
potentially are reduced. First, with respect to aggregation across 
stimuli, the unique effects associated with particular stimuli are 
cancelled relative to their contribution to the test concept/skill on 
which all items converge. Second, aggregating over occasions cancels 
incidental effects associated with specific sessions. Both types of 
aggregation should enhance the "reliability of a measure and increase 
the replicability of findings (Epstein, 1980). Therefore, the purpose 
of the second study was to examine the effect of aggregation across 
both test forms and occasions on group stabj 1 i ty coefficients for 
academic measures* 
Method 

Subjects . Subjects were 78 children (M=48, F=30) selected from 
three public schools in a midwest metropolitan area. Eafch child, 
selected as "high-risk" for receiving special education services, 
scored at or below the 15th percentile on a short duration measure of 
written expression within his/her grade level (see measurement 




10 

procedures in Deno, Marston, & Mirlcin, 1982). The numbers of children 
in grades 3-6, 'respectively, were 26, 17, 19, and 21. 

« 

Procedure . Onpe per week over a 10-week period, an alternate 
form of, an oral word reading measure was administered individually to 
each child (Deno et al., 1980), Each alternate form was generated by 
randomly selecting words from the ■ third grade level of the 
Harris-Jacobson Word List (Harris & Jacobson, 1972). The children's 
task was to read aloud words for one minute while thq examiner 
recorded errors. Words read correctly per minute ^nd errors per 
minute were scored. 

During each testing session, a writing sample also was obtained. 
For this measure of written expression, each student was presented 
with an alternate form of a story starter -each week and required to 
write on the story topic for three minutes. Number of correctly 
spelled words was scored. 

Data analysis . Group stability coefficients were calculated for 
the reading word correct rate score, the error rate score, and the 
written expression measure score. The odtf-even stability coefficients 
first were averaged across two observations (correlation between 
behavior on Week 1 and behavior on Week 2), then across four 
observations (correlation between behavior averaged over Weeks 1 and 3 
and behavior averaged over Weeks 2 and 4), then across six 
observations (the average behavior over Weeks 1, 3, and 5 correlated 
with the average behavior over Weeks 2, 4, and 6), then across eight 
observations, and finally across 10 observations. Within variables, 
these correlations were compared. 
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Results 

Table 4 displays 2, 4, 6, 8, and lC-day group stability 
coefficients for the three dependent variables. ^All correlations were 
statistically significant, and were consistently higher for the 
reading words correct score than for the reading error score or the 
written expression score. 



Insert Table 4 about here 



Within each measure, stabi lity coefficients increased as the 
number of observations increased. The 2-day reading error rate 
coefficient initially was .69 (280%) lower than the 2-day correct rate 
coefficient; yet, the difference between the correct and error rate 
coefficients decreased as the number of observations increased so 
that, when coefficients were based on 10 observation^ , the error rate 
correlation was only .12 (13%) lower than the correct rate 
correlation. Consequently, the stability coefficient for the error 
rate score improved dramatically .62 (25*4%) over the increasing number 
of observations. 

Similarly, the 2-day written expression coefficient was .39 (70%) 
lower than the 2-day reading words correct coefficient. Again, the 
difference between the reading words correct and written expression 
coefficients decreased as the number of observations increased. When 
coefficients were based on 10 . observations, the wr it ten expression 
correlation was only .10 (11%) lower. It appears, then, that 
aggregation over test forms and occasions dramatically affects oral 

ER?C !fi 
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reading error and written expression stability coefficients, but does 
not affect correct oral reading stability. 
Discussion * 

> The correct rate oral reading score again was an initially 
precise measure as evidenced by the group stability coefficients. For 
this initially reliable measure, aggregating over alternate forms of a 
test and over occasions made no real contribution to the measure's 
stability. However, as in Study 1, oral reading error rate was 
initially quite imprecise. Additionally, the written expression score 
initially was unreliable. Aggregating over alternate forms of a test 
and over occasions had a dramatic effect on these unstable measures, 
enhancing their stability to well within an acceptable level of 
alternate-form/test^-retest reliability when the stability coefficients 
were based on aggregations over 10 observations. 

Implications 

The results of these two studies have several implications for 
# the measurement of academic behavior. First, it appears that some 

academic behaviors initially are measured precisely. The WRMT, by all 
indices, rendered reliable student scores even when measurement was 
based on one observation. Given the documented strong psychometric 
adequacy of the WRMT, this may not be surprising. However, an 
interesting finding "of these studies is that the simple, short 
duration measures of either oral correct word reading or oral correct 
passage reading were very precise, just as precise as the WRMT, when 
measurement was based on one occasion and/or on one test form. For 
these behaviors, aggregating on the same test over occasions had 

ERJC 17 
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little or no effect on group stability coefficients, on within-subject 
reliability, or on the strength of relations with other reliably 
observed behaviors. Similarly, for these initially precise measures, 
aggregating over alternate forms of the same test and over occasions 
did not affect group stability coefficients. 

A second impl i cat ion of these studies, nevertheless, is that 
other academic behaviors, such as the error Ginn passage reading 
measure, the error word reading measure, and the written expression 
measure, are not measured reliably on the same test form on one 
occasion. For those behaviors, aggregating over occasions had a 
positive impact on group stability coefficients, on within-subject 
reliability, and on the strength of relations between variables; 
similarly, aggregating over alternate test forms and over occasions 
dramatically affected group stability coefficients. Therefore, for 
certain academic behaviors, sampling on the same test form across time, 
or on alternate test forms across time provides more precise 
information. This suggests the importance of aggregating. a student's 
academic test performance across observations and/or test forms for 
certain behaviors, in order to ensure accurate information for 
decision making. These studies indicate a minimum of 5 to 10 data 
points are required for reliable estimation of children's performance 
on relatively imprecise measures such as oral reading errors or a 
written expression measure. As teachers increasingly-use curriculum- 
based measurement to formulate decisions about students' progress 
toward goals, they might well consider aggregation as a means of 
improving the accuracy of their estimates of student performance and 
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the decisions they make.. 

Nevertheless, results of this study suggest that certain very 
simple, short duration academic measures, such as a one-minute correct 
oral word reading task and a orje-minute correct oral passage reading 
test, are very stable and correlate highly with more elaborate, 
global, norm-referenced standardized tests such as the WRMT* Results 
of these studies demonstrate the reliability and criterion validity of 
such short, simple measures, and suggest the suitability of 
substituting them for more elaborate and time-consuming measures of 
academic performance. 
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Table 1 

\. a 
Group Stability Coefficients (11=30) 



Measures 



Woodcock Word Identification 
Test - raw score 

6inn 720, 3rd gracje reading 
passage - words correct per 
minute 

Ginn 720, 3rd grade reading 
passage - errors per minute 




All correlations are statistically significant (£ < .001), 




0 ' 
~ J. 
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Table 2 

Within Subject Reliability Coefficients (N=30).. 



* . 2-day 4-day within subject change from 

Measure coefficient coefficient 2-day to 4-day coefficient 

WRKT 

Ginn Correct Rate 
Ginn Correct Rate 



.96 .97 - .012 

.85 .88 .036 

.65 .72 .080 



) 

1 n 



Table 3 

U 

Correlations Between Variables Calculated on One-Day Scores 
and on the Means of Four-Day Scores (N=30) 





Between 






1-day 
coefficient 


Correlations and p-values 
4-day 

£-value coefficient £- value 




WRMT and 
Rate 


Ginn 


Correct 


.91 


.001 


.89 .001 




WRMT and 


Ginn 


Error Rate 


-.46 


.011 


-.54 .003 
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Table 4 

Two, Four, Six, Eight, and Ten-Day /Observation Stability Coefficients 3 (N=78) 



Stability Coefficients 
4- 6- 8- 



10- 



Observation Observation Observation Observation Observation 



Reading Words Correct 


.94 


.96 


.98 


.98 


.99 


Rate 












Reading Error Rate 


.25* 


..58 


.75 


.83. 


.87 


Writing Words 


.55 


• .72 


.85 


.88 


.89 



a A11 correl 
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